The goal of the project

Based on the data about user and expert reviews, genres, platforms, and historical data on game sales, identify patterns that determine whether the game succesful or not to plan the advertising campaings for the online store Ice.

Step 1. Open the data file and study the general information

In [1]:
In [2]:
In [3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
Name               16713 non-null object
Platform           16715 non-null object
Year_of_Release    16446 non-null float64
Genre              16713 non-null object
NA_sales           16715 non-null float64
EU_sales           16715 non-null float64
JP_sales           16715 non-null float64
Other_sales        16715 non-null float64
Critic_Score       8137 non-null float64
User_Score         10014 non-null object
Rating             9949 non-null object
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
In [4]:
Out[4]:
Name Platform Year_of_Release Genre NA_sales EU_sales JP_sales Other_sales Critic_Score User_Score Rating
0 Wii Sports Wii 2006.0 Sports 41.36 28.96 3.77 8.45 76.0 8 E
1 Super Mario Bros. NES 1985.0 Platform 29.08 3.58 6.81 0.77 NaN NaN NaN
2 Mario Kart Wii Wii 2008.0 Racing 15.68 12.76 3.79 3.29 82.0 8.3 E
3 Wii Sports Resort Wii 2009.0 Sports 15.61 10.93 3.28 2.95 80.0 8 E
4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing 11.27 8.89 10.22 1.00 NaN NaN NaN
In [5]:
Name Platform Genre User_Score Rating
count 16713 16715 16713 10014 9949
unique 11559 31 12 96 8
top Need for Speed: Most Wanted PS2 Action tbd E
freq 12 2161 3369 2424 3990
Year_of_Release NA_sales EU_sales JP_sales Other_sales Critic_Score
count 16446.000000 16715.000000 16715.000000 16715.000000 16715.000000 8137.000000
mean 2006.484616 0.263377 0.145060 0.077617 0.047342 68.967679
std 5.877050 0.813604 0.503339 0.308853 0.186731 13.938165
min 1980.000000 0.000000 0.000000 0.000000 0.000000 13.000000
25% 2003.000000 0.000000 0.000000 0.000000 0.000000 60.000000
50% 2007.000000 0.080000 0.020000 0.000000 0.010000 71.000000
75% 2010.000000 0.240000 0.110000 0.040000 0.030000 79.000000
max 2016.000000 41.360000 28.960000 10.220000 10.570000 98.000000

Conclusion

At this stage, the data file on game sales for 2016 was downloaded and stored into the DataFrame. From the received general information about the data. The DataFrame consists of 11 columns (of which 5 object type and 6 - float type) and 16715 rows. I can conclude that the tables look good, the missing values are in 6 columns, which we will deal with in the next step. The table contains data on games of 12 different genres, the number of unique values in the platform column - 31. The name column contains 11559 unique names, I find it necessary to check it for duplicates.

Step 2. Prepare the data

2.1. Renayming columns, dropping missing values and converting the data to the required types

In [6]:
name                  2
platform              0
year_of_release     269
genre                 2
na_sales              0
eu_sales              0
jp_sales              0
other_sales           0
critic_score       8578
user_score         6701
rating             6766
dtype: int64
In [7]:
In [8]:
Out[8]:
16444

2.2. Add column 'total sales'

In [9]:
Out[9]:
count    16444.000000
mean         0.536023
std          1.558786
min          0.000000
25%          0.060000
50%          0.170000
75%          0.470000
max         82.540000
Name: total_sales, dtype: float64

2.3. Fill in missing values

In [10]:
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
1 Super Mario Bros. NES 1985 Platform 29.08 3.58 6.81 0.77 NaN NaN NaN 40.24
4 Pokemon Red/Pokemon Blue GB 1996 Role-Playing 11.27 8.89 10.22 1.00 NaN NaN NaN 31.38
5 Tetris GB 1989 Puzzle 23.20 2.26 4.22 0.58 NaN NaN NaN 30.26
9 Duck Hunt NES 1984 Shooter 26.93 0.63 0.28 0.47 NaN NaN NaN 28.31
10 Nintendogs DS 2005 Simulation 9.05 10.95 1.93 2.74 NaN NaN NaN 24.67
Out[10]:
year_of_release na_sales eu_sales jp_sales other_sales critic_score user_score total_sales
count 16444.000000 16444.000000 16444.000000 16444.000000 16444.000000 16444.000000 16444.000000 16444.000000
mean 2006.486256 0.264012 0.145930 0.078487 0.047594 33.494405 3.234237 0.536023
std 5.875525 0.818378 0.506716 0.311100 0.188005 35.821558 3.689043 1.558786
min 1980.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2003.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.060000
50% 2007.000000 0.080000 0.020000 0.000000 0.010000 0.000000 0.000000 0.170000
75% 2010.000000 0.240000 0.110000 0.040000 0.030000 70.000000 7.300000 0.470000
max 2016.000000 41.360000 28.960000 10.220000 10.570000 98.000000 9.700000 82.540000

While processing the data, I have defined the relationship between the columns rating, critic, and user score. According to ESRB rating, the publisher of the game must submit an application to the company for its evaluation. I am assuming games that have missing values in rating columns means that their publishers just did not apply for rating. I decided to fill in the missing values in the columns user score, critic score, and rating with 0 so that I can continue to work with these columns. Also, I filled in the values TBD (to be determine) in the column user score to 0. I changed data types in the column year of release to the type int from float (since the year value can be only an integer value) and in the column user score since we will use the values of this column in our research and I need numerical values. Also, I decided to drop missing values in the column year and name, since I see no way to replace missing values with suitable ones, and it's only 1.5% from our dataset. I added the column with total sales which contain the sum of the sales from each region.

Step 3. Analyze the data

3.1. Total sales by years

In [12]:

The graph above shows the total sales of video games by release year. Since 2001, video game sales have grown rapidly and picked at 2008. After 2008 the popularity of video games began to decline, and in 2012-2015 years total sales are approximately $250-350 million per year.

3.2. Sales by platforms

In [13]:
Out[13]:
Text(0.5, 1.0, 'Total sales of video games by platforms, USD million')

This figure shows the total sales of games on each platform. The most popular platform for the whole history of video games is the platform PS2. The highest revenue of video games on this platform reaches more than 1200 million USD. Next, the most popular platforms are X360, PS3, Wii. For my further research, I will take the two most popular platforms: PS2 and X360.

3.3. Research of selected platforms

In [14]:

The histogram shows the total sales of video games per year of the most popular platform - PlayStation2. Video games on the PS2 platform were released in 2000 and from 2001 to 2005 their sales were very high. After 2005 sales began to decline, and in 2011 they stopped altogether. So, games on the PS2 platform have been on the market for 11 years.

In [15]:

The histogram shows the total sales of video games per year of the second popular platform - X360. Video games on the X360 platform were released in 2005 and the majority of the sample is concentrated between 2008 and 2011 years, peaking in 2010 and declining after until 2016 (about 12 years on the market).

In [16]:

I decided to display the most popular platforms of 2015 and compare the total sales on these platforms with total sales of the most profitable platforms to get a clear idea of how the sales of video games have changed (the chart below), and each of the platforms are gaining popularity or on the peak of popularity to decide which games on which platforms will be the best-selling in 2017.

In [17]:

Type Markdown and LaTeX: α2

In [18]:
200020022004200620082010201220142016050100150200
PS2X360PS4XOne3DSTotal sales of selected platformsTotal sales

This graph shows the total sales of the two most popular platforms in the history of video games (PS2 and X360) and of the most popular platforms in 2015. Total sales of the PS2 are growing rapidly, I believe it will also remain the most popular in 2017. Due to the fact that the lifecycle of the platform is about 10-12 years and this number is decreasing, and that the data for 2016 may be incomplete, I propose to consider data from 2011 inclusive. So I identified two platforms with the greatest potential in 2017: PS4 and XOne.

In [19]:

3.4. The global sales of all games, broken down by platform

In [20]:
Out[20]:
4022
In [21]:
In [22]:
Out[22]:
3554

In this step, I'll determine the upper limits of outliers, removed the outliers, and stored them in a separate DataFrame. After deleting the outliers I built the chart which shows us the influence of the platform on total sales. The platforms with the biggest value are X360, PS3, Wii, WiiU, and XOne.

In [23]:
In [24]:
name platform genre rating
count 498 498 498 498
unique 498 1 11 5
top The Testament of Sherlock Holmes X360 Action M
freq 1 498 148 137
In [25]:
0501000505010005101505051015
Scatter matrix correlation by scores and total sales on the platform X360critic_scoreuser_scoretotal_salescritic_scoreuser_scoretotal_sales
The correlation coefficient between user scores and total sales:
0.1781631896945023
The correlation coefficient between critic scores and total sales:
0.31566075211358774

A low correlation between user score and sales suggest that user ratings have little to do with revenue. We see a slightly higher indicator by the ratings of critics and we also see on the graph that games with large sales are highly rated by critics.

Reviewer's comment v. 1:

Yes, we have weak correlation between variables. Please note that correlation function shows only linear dependecy between variables. Maybe this link will be interesting for you: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/.

3.6. Distribution of games by genre

In [26]:
mean_sales total_sales
genre
Action 0.439559 557.80
Shooter 1.224650 402.91
Role-Playing 0.528237 245.63
Sports 0.609640 237.15
Misc 0.371787 139.42
Platform 0.726885 88.68
Racing 0.490611 88.31
Fighting 0.420755 66.90
Simulation 0.369559 50.26
Adventure 0.109854 45.04
Strategy 0.189402 22.16
Puzzle 0.138732 9.85
Out[26]:
Text(0.5, 1.0, 'Distribution of games by genre and total sales, 2011-2016')

The diagram above shows the distribution of each genre of video games on the market in 2011-2016. The most profitable genre is action with total sales of more than 550 USD million. In the second place the genre Shooter with total sales of about 400 USD million. In the third place - Role-Playing with total sales of 245 USD million. The least profitable genres are Puzzle, Strategy, Adventure.

Conclusion

At this step we founded how much games were released by the years (the peak of sales is 2008) and determined which platform was the most popular for the whole history of video games(PS2 and X360) and what is the most profitable platform (X360). Determined the lifecycle of the platform (about 11 years) and also determined that the lifecycle decreses over the years. I think is due to technological progress, these days, developing a new platform takes much less time than before. Also we defined the two platforms that will be popular in 2017 - XOne and PS4, and period for our investigation due to the lyfecycle of the platforms: 2011. The most popular genres are Action, Shooter and Role-Playing.

Step 4. Create a user profile for each region

4.1. User profile for North American region

In [27]:
total_sales
platform
X360 226.80
PS3 168.26
PS4 108.74
XOne 93.12
3DS 82.65
total_sales
genre
Action 230.92
Shooter 194.43
Sports 108.70
Role-Playing 83.78
Misc 69.69

4.2. User profile for Europe region

In [28]:
total_sales
platform
PS3 163.10
PS4 141.09
X360 117.10
3DS 61.27
PC 56.70
total_sales
genre
Action 199.12
Shooter 148.02
Sports 88.48
Role-Playing 62.30
Racing 40.58

4.3. User profile for Japan region

In [29]:
total_sales
platform
3DS 100.62
PS3 49.39
PSP 24.04
PSV 21.84
PS4 15.96
total_sales
genre
Role-Playing 79.87
Action 63.62
Misc 17.60
Fighting 12.38
Platform 12.13

Top 5 platforms

In [30]:
X360PS3PS4XOne3DSPCPSPPSV050100150200250300350400
JapanEuropeNorth AmericaMarket shares of platforms for each regionTotal sales

On the histogram above we can see the market share of the 5 most popular platforms in each region. We can see the signature difference between the preferences of users from North America, Europe, and especially from Japan. In Japan is not enter into the 5 most sellest platform the most popular platform that is using by the consumers in NA and Europe, and also we see 2 platforms that are not popularly in NA and in Europe: PSP and PSV. And we have 3 platforms that are popular in the all-region: PS3, PS4, and 3DS. The XOne is in the top 5 only in NA and PC is only in Europe.

In [31]:
ActionShooterSportsRole-PlayingMiscRacingFightingPlatform0100200300400500
JapanEuropeNorth AmericaMarket shares of genres for each regionTotal sales

On the graph above we see the top sailing genres in each region (NA, Europe, and Japan). We have 2 genres that are popular in each region - Action and Role-Playing. Gamers in Japan are not interested in genres such as Shooter and Sports, which are in the top 5 most popular genres in Europe and NA, and European gamers rarely choose the Misk genre. Instead of Misk they prefer Racing (I think this is due to the love of Europeans for racing, films about racing, etc.). Japanese players are also interested in Fighting and Platform genres, which are not popular in other regions, and this difference is due to cultural characteristics and mentality of the Japanese.

Conclusion

On this step we defind the preferences of users for each region (North America, Europe, Japan) for the video games platforms and genres. I think this is due to the differences in cultural characteristics and possibly the policy of state itself in relation to the gaming business (for example the platforms PSP and PSV are platforms developed in Japan).

Step 5. Test the hypotheses

5.1. Average user ratings of the Xbox One and PC platforms are the same

The null hypothesis is the average user ratings of the XBox One and PC platforms are the same. The alternative hypotesis is the average user ratings are different.

In [32]:
In [33]:
user_score_xone user_score_pc
year_of_release
2013 5.200000 5.352632
2014 4.860656 5.095745
2015 4.451250 4.910000
2016 5.005747 5.248148
2011 NaN 4.722302
2012 NaN 5.778689
p-value:  0.2147094461125155
We can't reject the null hypothesis

Conclusion

We can't reject the null hypothesis, the average user ratings of the platforms XBox One and PC are the same. As we can see, the type of platform does not have a strong influence on the user rating, the average user rating for the platforms are the same.

Reviewer's comment v. 1:

Yes, there are no statistical significant differences between the average user ratings.

Maybe this link will be interesting for you: https://www.analyticsvidhya.com/blog/2019/09/everything-know-about-p-value-from-scratch-data-science/

5.2. Average user ratings for the Action and Sports genres are different

The null hypothesis is the average user ratings of the genre Action and for the genre Sports are the same. The alternative hypotesis is the average user ratings are different.

In [34]:
user_score_action user_score_sports
year_of_release
2011 3.906303 2.695868
2012 3.432830 4.181481
2013 4.434694 3.292453
2014 3.690957 4.394444
2015 2.963241 3.818644
2016 3.170225 4.187500
20112012201320142015201600.511.522.533.544.5
ActionSports
p-value:  0.6474119638356557
We can't reject the null hypothesis

I set the parameter equal_var to False since we don't accept that the means of the user scores will be the same, it's the two independent populations. It turns that the user ratings for the games of genre action and genre sports are the same. This is logical since we are considering the two most popular genres, that are constantly competing with each other.

General conclusion

We can see the total annual revenue for console games declining. They were at their peak in 2004, now their popularity is falling. My guess is that this is due to the growing market for mobile video games (or another type of video game that is gaining popularity: virtual reality). Regionally, North America is the largest market (percentage of total games sold over the past five years), followed by Japan. The gaming industry is quite specific, and in the graphs above, we see how strongly the region in which the researched user is located affects. the user profile by region shows us that the CC platform is the most popular in Japan, which is not even included in the top 5 most popular user platforms in America and Europe. For greater success, I propose to divide the purchased types of video games depending on the region where the store is located in accordance with the user profiles that we have compiled. Here are the user's profiles for each region and let's take into account mainly those platforms that are at the peak of popularity or are gaining popularity:

  1. North America. NA has the largest share of the consumer video games market. The most popular platforms in this region are X360, PS3, PS4, Xone, and 3DS. But taking into account the research conducted, I propose to focus on sales on the PS4 and XOne platforms. Here is the list of the most popular genres (in descending order): action, sport, shooter, platform, and misk.
  2. Europe. The preferences of European users of video games do not differ much in terms of genres, the only difference is that the 5 most popular genres in Europe include not platform, but racing. As for the type of platform, we also see only one difference with the NA region: the PC, not the XOne.
  3. Japan. For this region, I propose to promote video games on the PS4 platform, continue selling 3DS, and also two platforms that are popular only in Japan: PSP and PSV. The most popular genre in this region is role-playing, and further in descending order: action, misc, fighting, platform.

Taking snapshot - this may take a few seconds

Snapshot succeeded - newplot.png